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Guidelines for evaluating criterionrreferenced tests 


and test sanuals are proposed and applied tc a sample of fopular 
commercially published tests. The well-kncwn Test Standards published 


ey’ a joint committee of professional societies is helpful, though not 


ompletely applicable, and was used together with other ecurces in 


the preparation 


of an evaluation fers. This form is designed to be 


useful to both users and developers of critericn referenced tests. 
The 39 guideline questions were applied to 11 tests. Amcng the common 
weaknesses found were: (1) lack of dcsain specifications; (2) no. 
indication of the qualifications of the individuals who prepared the. 


test objectives; 


(3) possible content Lias due to the use of iten 


analysis in test construction; (4) inadequate information about test 


reliability: (5) 
scores: (6) not 


lack of information about the rationale for cutting 
enough information about error in test,sccres; and 


(7) no information about factors affecting the validity of scores, 


Suqgestions for improving the guidelines are encouraged. me 
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Most of the major test publishers have published in the last few 


ALON © 

oh years a wide assortment of criterion-referenced tests. in addition, 
LN 

™ many school districts, state agencies, siaali testing firms, and con- 

Oo 
rt sulting firms have produced tieir own criterion-referenced tests. | 
tus 


Criterion-referenced tests are cesizned to address many problem ! 

areas. — For exampie, Ne ee ere en are being used to ! 

monitor student progress through school programs, to diagnose learning ! 

’ ‘ 
disabilities, to report student progress to parents, to evaluate various 
types of programs, and to certify or license saiteemtunaletd many | 
fields. Unfortunately, it appears to. us, and to many users of 
criterion-referenced tests we have spoken with, that many of the available 

\ tests fall short of the technical quality necessary foy-them to accomplish* 

their intended purposes. Perhaps“one explanation is that many criterion- 

referenced tests were developed before an ndepuate testing deshantned was 
fully explicated. Fortundtely, there now exists an adequate technology 

for constructing criterion-referenced tests and using criterion-referenced 

test scores (Hambleton and Eignor, 1978; Hambleton, Sysednehan, Algiaa, 

Coulson, 1978; Popham, 1978). Another possible sauanabion is that 


there has been a shortage of guidelines for constructing and using 


criterion-referenced tests. Certainly the well-known Test Standards for 


Paver presented at the annuel meeting of NCME, Toronto, 1978. 


2Laboratory of Psychometric and Evaluative Reseatch Report No. 73. . 
Amherst, MA: School of Education, University of Massachusetts, Amherst, 1978. 
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‘not very useful for individuals interested in evaluating criterion- 


ee 


evaluating tests and test manuals prepared by a joint committee of AERA/ - 
APA/NCME is helpful, but it is not completely applicable to criterion- 


referenced tests. Besides the incompleteness of the AERA/APA/NCME ; 


Test Standards for evaluating criterion-referenced tests and test 
‘manuals, what relevant information there is, is scattered through 75 


pages or so of other materials appropriate for norm-referenced test 


evaluations. Therefore, the Test Standards in its present form, is. 


‘ * ; " 
referenced tests. : ; 


The primary purpose of this paper is-to propose a set.of guide- A 
lines for evaluating Pe ae a, tests and test eile The 
guidelines should be useful to both users and developers of criterion- 
referenced tests. Test standards ane not offered in the paper (an example of a 
standard is; "test score reltapiaity must exces 20"); but we do offer a 

set of questions for consideration by potential users and developers 

of criterion-referenced tests. The only other efforts we are aware of 


to develop guidelines for evaluating criterion-referenced tests and test 


manuals’ are Popham (1978, Chapter 8), Swezey and Pearlstein (1975), and 


-Walker (1977). <A Secondary purpose is to report on our use of the 


guidelines with eleven commercially available criterion-referenced test 
¢ 
batteries. . ‘ 


One caution and one comment seem appropriate to introduce at this 
point. The guidelines ‘represent our own biases about what is important 
technical information for users to have in making informed decisions about 


- 


the quality of criterion-referenced tests. _ Also, in this paper we 
+ y 


did not provide (1) a rationale for the incluySion ofeach guideline, 


and (2) specifics on how the guidelines p> ae Interested readers 
. “ 


iP 


are encouraged to read Eignor (1978) and Hambleton and Eignor (1978) 


for the information. 


‘ » 
A Proposed Set of cultedines Cs 
The list of guidelines was generated by placing’ ourselves in the 
role of potential purchasers of a criterion-referenced test, and asking 
"What questions would we want to answer before making’a decision to use 
a criterion-referenced test in a’ particular situation?" Questions 
weré organized around ten broad categories!. They are: Objectives, 
Test Items, Administration, Test Layout, Reliability, Cut-off Scores, _ 
Validity, Norms, Repstting of Test Score Inf tion, and Test Score 
‘Interpretations he questions are as follows: 
‘ co 
v4 . \ 
Objectives \ 
A.1 Is the purpose (or purposes) of the test stated in a clear 
and concise fashion? 
A. 2 a each objective clearly written so that it is possible 
to identify an "item pool"? 
A.3 Is it clear from the list of’ objectives what the test ™ 
measures? 
A.4 Is an appropriate rationale offered for Including each 
; objective in the test? | 
A.5 Can a*potential user "tailor" the test to meet local : 


needs by determining which objectiveg from a pool of objec- 
tives offered by the publisher are tp be measured by the test? a 


A.6 Is there a match between the content measured by the test . 
and the situation where the test is/to be used? 


A.7 Are individuals identified who wergé responsible for the 
preparation of objectives? 


A.8 Does the set of objectives measured by the test serve as a 
representative set from some content domain of interest? 


‘Ihe very important factors of cost and time limits are not considered 
here, but they are included.in our evaluatipn form. 
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B.1 /Is the item review process described? | 


B. { Are the test items valid indicators of the’ en jeettver 
they were GerehOpey to meewuees / a 
/ } 
-3 Is the set of test items eeuennidas an /objective repre- 
sentative of the "pool" of items measuying the objective?.: 
; / 


B.4 Are the items free of technical flaws} E : . 
{ 
B.5 Are the test items in an appropriat¢ format to measure 
the objectives eney were developed to measure? ) 


ea : B.6 Are the test items free ‘of «bias (for example, sex, ethnic, 
- or racial)? , / 


| 
B.7 Was a heterogeneous sample of examinees employed in 
piloting the test Shenae 


B.8 Was the item analysis data used only to detect "flawed" 
items? 2 


C. Administration 

C.1 Do the test, directions include information relative to 
test purpose, time limits, practice questions, answer 
sheets, and scoring? / 

C.2 Are the test directions clear? 


& 
C.3 Is the test easy to score? 


: C.4 Does the test manual specify an examiner's role and 
responsibilities? - . 


D. Test Layout 


D.1 Is the layout of the test booklets attractive? 


% . 
D.2 Is the layout of the test booklets convenient for examinees? 


‘~~ 
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E. Reliability, 


E.1 Is the type of reliability information offered in the test 
manual appropriate for the intended use (or uses) of the 
scores? 


E.2 Was the sample (or samples) of examinees used in the 
reliability study adequate in size, and representative 
of the population for whom the test is intended? 


E.3 Are test lengths suitable to'produce tests with desirable 
levels of test score reliability? 


E.4 Is reliability information offered in the test manual 
for each intended use (or uses) of the test scores? 


_F. Cut-Off Scores 


F.1 Was a rationale offered for the selection of a. method for 
determining cut-off scores?. 


F.2 Was the procedure for implementing the method explained, 
and was it appropriate? 


F.3 Was evidence for the valddity of the chosen cut-off score 
(or cut-off scores) offered? 3 


G. Validity 


~ 


G.1 Does the validity evidence offered in the test manual — 
address adequately the intended use (or uses) of 
scores obtained from the test? 


G.2 Is‘an appropriate \discussion of factors affecting the 
validity of test scores offered in the test manual? 


H. Norms 
. : 
H.1 Arethe norms data reported in an appropriate form? 


H.2 Are the samples of examinees utilized in the norming study .described? 


H.3 Are appropriate cautions introduced for proper test 
score interpretations? 


a 
"Se 


‘’ 


T.. Reporting of Test Seon: Internation : 


Tat Are. the test scores reported for examinees on an objec- 

tive by objective basis? | 

| 

1 e Pa .} | 
4.2 ‘Are there miitipie options available to the user for 
: reporting of test results (for SxORETS) by class and 

grade: within a achood)? ; 
, os i 

be | Are venient procedures available for scoring tests by 7 
hand|/ and forms available for reporting test score inf tion? 


«* : Wen PR i 


core Interpretations 


J.1. Are'suitable cautions included in the manual for inte a 
preting individual and group objective score information? 


J«2 Are appropriate guidelines offered in the manual for 
utilizing test scores to make descriptive statements, 
instructional decisions, program evaluation decisions, 

; or other stated uses of the test scores? © 


ae | 
A convenient rating form is given on the next four pages. 


. 


Evaluation of Eleven Criterion-Referenced Tests 
Eleven of the more popular criterion-referenced tests were 
selected for review. The names of the tests and some descriptive 


taformétton are presented in the chart. : \ : 


ee a ee 


aaa ae aaa aaa aia ia aes el 


Our primary purpose was to ascertain the extent to which .these ; 
tests met our guidelanae: We have reported our evaluatior of each test ® 
relative to each guideline, but the more important information is arrived 
at by determining how aati the tests as a group meet each of our guide- 
‘lines. The group information is informative because it hélps to pin-point 
areas where commercial materials are in Read of revisions and farvhet« {| oo» 


development. 
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Manual Evaluation Form 


Background Information : ‘ 


fc 


Test Name: Forms and Levels: 
| Test Publisher: _ _ Author(s): 
\ ; ; . ‘ , 
oh Year of Publication: ‘ Cost: 


Reusable Booklets: Yes No 


th Special Test Administration Conditions: 


. ,Manual and Other Technical Aids: - 


For each of the questions below there are 
four possible answers: Acceptable", 
"Unacceptable", "Unsure", and "Not 5 

“ Applicable". Place a "Vv" in the column 
corresponding to your answer to each 
question. 


Question 

“+ A.1. Is the purpose (or purposes) of 
‘the test stated in a clear and con- 
cise, fashion? | 


A.2. Is each objective clearly written 
so that it is possible to identify 
an "item pool''? 


A.3. Is it clear from the list of ob- 
jectives what the test measures? 


A.4. Is an appropriate rationale 
offered for including each objective 
in the test? 


A.5. Can a user "tailor" the test to 
meet local needs by selecting objec- 
tives from a pool of available ob- 
jectives? 


A.6. Is there a match between the 
content measured by the tgst and 
the situation where the test is to 
be used? 


~8- a, 


For each of the questions below there are ; Ratings 


four possible answers: "Acceptable", 
"Unacceptable, "Unsure", and "Not - " 
Applicable". Place a "Y' in the column a 
corresponding to your answer to each Pi 
question. . e & 
Question © 


A.7. Are individuals identified who 
were responsible for the preparation 
of objettives? 


Comments 


A.8. Does the set of objectives mea- 
sured by the test serve as a repre- 

sentative set from some content 

domain of interest? 


B.1. Is the item review process 
described? 


B.2. Are the test items valid indica- 
tors of the objectives they were 
developed to measure? 


B.3. Is the set of test items measuring 
an objective representative of the 
"pool" of items measuring the 
objective? 


B.4. Are the items free of technical 
flaws? 


B.5. Are the test items in an appro- 
priate format to measure the objec- 
tives they were developed to measure? 


B.6. Are the test items free of bias 
(for example, sex, ethnic, or racial)? 


B.7. Was a heterogeneous sample of 
- examinees employed in piloting the 
test items? 5 


\ 


B.8. Was the item analysis data used 
only to detect "flawed" dtems? 


C.1. Do the test directions include in- 
formation relative to test purpose, 
time limits, practice questions, an- 
swer sheets, and scoring? 


For each of the questions below there are 
four possible answers: "Acceptable", 
"Unacceptable", "Unsure", and "Not 
Applicable". Place a "v\' in the column 
corresponding to your answer to each 
question. A 


Ques tion 


C.2. \Are the test directions clear? 


C.3. Is the test easy to score? 


C.4. Does the test manual specify an 
examiner's role and responsibilities? 


D.1. Is the layout of the test booklets 
attractive? ‘ 


D.2. Is the layout of the test booklets 
convenient for examinees? 


E.1. Is the type of reliability infor- 
tion offered in the test manual 

“ \&ppropriate for the intended use (or 

es) of the scores? 


E.2. Was the sample of examinees ade- 
quate in size, and representative of 
the population for whom the test is 
intended? mS 


E.3. Are test lengths suitable to pro- 
duce tests with desirable levels of 
test score reliability? 


E.4. Is reliability information offered 
in the test manual for each intended 
use (or uses) of the test scores? 


F.1. Was a rationale offered for the 
selection of a method for determining 
cut-off scores? 


F.2. Was the procedure for implementing 
the method explained, and was it ap- 
propriate? 


For each of the questions below there are 
four possible answers: "Acceptable", 
"Unacceptable", “Unsure”, and "Not 
Applicable". Place a "/" in the column 
corresponding to your answer to each 
queation. © 


Question 


F.3. Was evidence for the validity of 
the jchosen cut-off score (or cut-- 
; of [sored offered? 


G.1. Does the validity evidence offered 
in the test manual aldress adequately 
the intended use (or uses of scores) 
obtained: from the test? 


G.Z. Is an appropriate discussion of 
factors affecting the validity of 
test scores offered in the test 
manual? 


H.1. Are the norms data reported in an 
appropriate form? 


” 


H.2. Are the samples of examinees 
utilized in the norming study 
described? 


H.3. Are appropriate cautions intro- 
‘duced for proper test score inter- 
pretations? : 


I.l. Are the test scores reported 
examinees on an objective by objec- 
tive basis? 


1.2. Are there multiple options avwail- 
able tothe user for reporting of 
test results (for example, by class 

. and grade within a schoad)? 


1.3. Are convenient procedures avail- 
- able for scoring tests by hand, and 
forms available for reporting test 

score information? 


J.1. Are suitable cautions included in 
the manual for interpreting individua 
and group objective score information}. 


J.2. Are appropriate guidelines offered 
for utilizing test scores to accomp- 
lish stated purposes? 


eee 
"a e " A e ; ‘ F we 69 : 
Publication 
Code Name of Test » Gradcs Levels Forms. \ Date Publisher 
‘ 1 1976 Stanford Diag- 
nostic Mathematics om Harcourt Brace 
Test ae: 1-12 4 2 1976 _ Jovanovich 
Z 1976 Stanford Diag- cow Harcourt Brace 
_ nostic Reading Test 1-12 4 2 1976 Jovanovich 
3 Skills Monitoring . . ” 
System-Reading 3-5 Su ae 1 1975, Harcourt Brace ° 
( Jovanovich 
: ’ \ e . 
Sex 4 Individual Pupil ' 
Monitoring System- 
\ Mathematics _ 1-6 6 2 1974 Houghton-Mifflin 
5 Individual Pupil 
Monitoring System- ‘ ay : 
Reading 1-8 8 2 1974 Houghton-Mifflin 
6 Diagnostic Mathe- , TB/NcGraw- 
matics Inventory |° pe So ee 74 '1 1977 Hill 
7 Prescriptive Read- ' CTB/McGraw+ 
ing Inventory K-6.5 6 ved 1977 Hill 
8. Diagnosis: An 
Instructional Aid-, Science Research 
Mathematics and 1-6 : 2 2 1974 Associates 
Reading 
9 Mastery: An : . 
Evaluation Tool- . Sciencé Research 
SOBAR Reading K-9 10 2 1975 Associates 
10 Mastery: An @ 
Evaluation Tool- fi i Science Research 
: Mathematics K-8 9 2 ' 1974 Associates ~§ 
: % 7 
at Fountain Valley 
Support System Richard L. Zweig 
. in Mathematics K-8 9 iL 1974 Associates 
oS 
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a 35: Only about half of the publishers included information about 


& ene Ae . : se 


: be 


In judging the quality of a test and test manual relative to each 


: guideline, the following rating scale was used: 


a 
im a ° 


Acceptable . Cae 
Acceptable, with reservations ; 


Unacceptable, ‘data, offered was unsuitable 
improperly. used - 


. Unaggeptable, ‘no data was offered 


= Not Applicable 


‘Our most significant impressions of the test and test manuals zeytenee i 
wo SH 
are as follows: 


1. In areas such as Administration, Test bdyout and Norms, there 
are te problems. 


/ 2. Current commercially available "criterion-referenced tests" 
reviewed in this paper should be called “objectives-referenced 
tests" since the tests appear to be developed from behavioral 
objectives (Popham, 1978). Starting to develop a test froma. 
listing of behavioral objectives is less than ideal because | 
behavioral objectives usually do not lead to unambiguous 
definitions of the "item pools” keyed to the behavioral ob- . 
jectives. The solution is to write "domain specifications" 
(Popham, 1978). ‘ 


~- 


‘the qualifications of individuals who prepared the objectives 
> measured by their test. The qualifications of participants a 
in this aspect of the test development process is important : 
information for potential users. = 


13 ae Te) 


ae : ee? < ~ § . 


ate ee ee es | Table 1 


Summary of Ratings of he Criterior-foferenced Tests 


oI Ill SS === 


’ 7 Tes . 
Question i 2° 2 5 &. © 2 9 10 * 3% 
Al- A kh A # © & & £ & A x. 
A2 Se ae oe a X KS, 
A3 AR ROR OR. PRS BS A GR RO GK 
AGE OS CAS eB OREO OR A oe A ee 
rhs Ke Bok A A -X x A A Rok m 
. ‘ A6 AA OB” RO ROR Oe Ca Ree 
: A7 Pe ane Soe GE aa ae: as I ee Oe a 
A8 A- Am Am AD AD OAT OAT OAT OA Aa 
« t s x r a iw 
Bl SE Se a, EE EE ok Ae ey 
B2 Ok 2 oF 2 Se oe a A- 
53 SR a oo oe ae ee X x 
B4 ‘& & 2 & BR BB BTR ® A 
BS be SE OR RE de ae a Ge Ce A 
B6 Ae kh Re? ¥ , ¥. Te. ¥ A Y 
B7 A A A A A A A Y Ys. = x 
B8 eS Sock "ae Se Bem. eu Y 
Cl A A A A A& & A KH? @2° KR 
C2 Ac Ae ® & -A aa ar a coer A 4 
C3 > oh kh 2k OE , -« + & A” A 
C4 he OR eK OR ER ¢ & ok -& <& A 
wo DL, kw RS OR BO OP ee Oe oh Ow A 
D2. fo OR Be ee a Re Ry Mae Oe A 
El ee YY. re eR YS X Y 
E2 AB we OR ee OR OR OR CR Y 
E3 At OS a OE OE x al 
E4 Ao be ge OY y: 36 Se er Oe X Y 
Fl io ES ee Re KE A eg OR A A Y 
F2 fsa & SY x -s R. ok A .Y 
F3 a ie ee ee Ye oe eee 
Gl A A A x X A 4 xX A Y 
G2 YO AP Oye ye PY a a a oe Y 
' ® 
H1 i wh ae I aE Be Re N N 
H2 A A. N N N ? z N N N N 
H3 A A <N N YY = WwW oR N N 
. 1? ae cr Or ae ae a ee Oe ee ere 
12 - | a ee Cem  & RR A Aes 
. ios a a er er | J ee oe A A 
<i ee Bk OR Oe eae, 
J2 A AE OE EO tek oe ee ee 
ess 


We did not have the proper materials to assess the quality of thd test. 
14 _ in the areas marked by a "?". 


‘2the ‘Information was on a cassette. -We di not listen to the cape and. s0 
"we were not in*a position to rafe this assect of the test. me 
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4. Since test developers have not used "domain specificutions", 
it is impossible to assess "item representativeness". Iten 
representativeness is essential if users desire to use ob- ~ 

‘ ~ jective scores to “generalize to the domains of behaviors 
defined by the objectives." If item representativeness is | 
not established, scores can only be interpreted in terms of 
the specific items included in the test. in 


5. “Item analysis" is an area in which there are two problems: 
(a) Too little explanation is offered of the choice of parti- 
cular item statistics and of the specifics of item statistics 

_ usage, and. (b) item statistics are used in-test construction 
thereby "biasing" the content validity of the test in unknown 


ways. a 


‘ 


6. Test score reliability was not ‘handled very well in most of 
the manuals. Either (a) inappropriate information relative 
to the stated uses of the test scores was offered, or Sb no 
information was offered. : 


7. Cut-off scores are typically offered, but there is no rationale 
offered for setting cut-off scores. Procedures used for setting 
cut-off. scores are not explained, nor is any evidence offered 
for the "validity" of cut-off scores (for example, do those’ 
examinees classified as "masters" typically perform better than 
"non-masters" on some appropriately chosen external criterion 
measure?). ‘ . - 

8. Factors affecting the validity of scores are not offered in 
any of the manuals. 


-9. Only a few of the manuals introduced the notion’ of "error" 
in test scores. It is extremely important for users to have 
some indication of the "stability" of their objective scores 
and/or "consistency of mastery/non=mastery decisions". 

is . y . ‘ p 


Concluding Remarks 


. 


_ Our proposed guidelines were developed after carecuh study of the 
erfterion-referenced testing literature and the Te&t Standards. However, . 


” 


they are offered here only .to serve as a"catalyst" for~ - ~further dis- 
cussion and debate on a topic of considerable importance to the test en 
measurement field. Our use of the proposed guidelines to evaluate eleven 


t 


criterion-referenced tests: was intended to (1) demonstrate that the proposed - 


hb 


a? 


guidelines were workable, and (2) highlight areas where considerably 
more (or different) work on the part of test developers is needed. 

* Our goal for preparing this paper hag heen accomplished if (1) it . 
stimulates others to extend and improve upon our guidelinds, and (2) it — 
helps to direct test devetspavts toward more acceptable practices of 
eriterion-referenced test construction aids prdoaxation of test manuals. 


Individuals with suggestions for improving the guidelines are 


encouraged to write the authors. 
en ae = \ 


ad 


~. 


aon Hambleton, R. K., Swaminathan, H., Algina, J., & Coulson, D. B. 
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